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Abstract 

Suppose that party A collects private information about its users, where each user's data is represented 
as a bit vector. Suppose that party B has a proprietary data mining algorithm that requires estimating 
the distance between users, such as clustering or nearest neighbors. We ask if it is possible for party A 
to publish some information about each user so that B can estimate the distance between users without 
being able to infer any private bit of a user. Our method involves projecting each user's representation 
into a random, lower-dimensional space via a sparse Johnson-Lindenstrauss transform and then adding 
Gaussian noise to each entry of the lower-dimensional representation. We show that the method preserves 
differential privacy — where the more privacy is desired, the larger the variance of the Gaussian noise. 
Further, we show how to approximate the true distances between users via only the lower-dimensional, 
perturbed data. Finally, we consider other perturbation methods such as randomized response and draw 
comparisons to sketch-based methods. While the goal of releasing user-specific data to third parties is 
more broad than preserving distances, this work shows that distance computations with privacy is an 
achievable goal. 

1 Introduction 

In recent years, there has been an abundance of rich and fine-grained data about individuals in domains such 
as healthcare, finance, retail, web search and social networks. It is desirable for data collectors to enable third 
parties to perform complex data mining applications over such data. However, privacy is a natural obstacle 
that arises when sharing data about individuals with third parties, since the data about each individual may 
contain private and sensitive information. 

We ask the following question: Is it possible to empower third parties with knowledge about users 
without compromising privacy of the users? Suppose that party A collects private information about its 
users, where each user's data is represented as a bit vector. We focus on the setting where a third party 
B has a proprietary data mining algorithm that requires estimating the distance between users, such as 
clustering or nearest neighbors. We ask if it is possible for party A to publish some information about each 
user so that B can estimate the distance between users without being able to infer any private bit of a 
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user. Even in a scenario where the data mining algorithm is run by the data collector itself (that is, parties 
A and B are the same), privacy breaches are possible if the result of data mining is to be published [19]. 
The reason is that complex algorithms that access private data can be susceptible to either unintended or 
adversarial attacks [B]. While one way to address this problem is to design a sophisticated algorithm that 
respects privacy (e.g., [7]), our approach can ensure that the data given as input to the algorithm itself does 
not compromise user privacy. 

Although estimating distances between users in a privacy-preserving manner is a fundamental primitive 
in many data mining applications, the approaches known to date have certain short-comings. Approaches 
resorting to user id anonymization while keeping the data unchanged have been shown to be badly insufficient 
to preserve privacy [Udn]- A random projection based method to preserve distance between users was 
proposed in [52], but later work demonstrated concrete attacks to breach privacy on this method |14[|30j . 
These approaches suffer due to lack of a rigorous privacy definition. On the other hand, approaches to data 
sharing such as the recently proposed provably private methods for search query and click data release [161120] 
accomplish it at the price of giving up on all user-level information. 

Contributions: We describe a simple, natural way to publish a sketch of a user that simultaneously 
preserves privacy and enables estimation of the distance between users. The main idea is to project a d- 
dimcnsional vector representation of a user's feature attributes into a lower fc-dimensional space by first 
applying a random Johnson-Lindenstrauss transform and then adding Gaussian noise N{0,a^) to each bit 
of the resulting vector. We prove that this perturbed lower-dimensional vector preserves differential privacy, 
i.e., an attacker who knows all but one attribute of a user cannot recover the value of that attribute from 
the published information with high confidence. In terms of utility, we show how to recover the distance 
between users from the perturbed sketches. We show that the squared Euclidean distance between pairs of 
users is preserved in expectation. Further, with high probability, the distance between users is preserved up 
to the usual Johnson-Lindenstrauss factors plus an additive factor that depends on k and the variance of 
the noise cr^. 

We also compare our proposed solution to other candidate solutions. For instance, we compare to the 
more straightforward solution of directly adding noise to the user x user distance matrix. We show that 
in order to achieve the same privacy, the variance of the added noise is higher for the more direct method. 
Concretely, we show that the projection-based method is better if the maximal weight of a user vector is much 
smaller than the number of users. Also, we analyze a randomized response [32] method for data sharing. For a 
fixed value of the target dimension k, we show that for nearby points (those within squared distance 0(-\/fc)), 
both algorithms are inaccurate. Projection-based methods are better when pairs are medium distance apart, 
i.e., between ^/k and \fdhj^. Randomized response methods excel for pairs that are far apart. 

While the problem of sharing data with third parties is more complex than producing sketches of user data 
that preserve distances between users, this work offers a privacy-preserving method to enable third parties to 
execute one of the core data-mining primitives. Since the goal of our work is to enable distance computations, 
understandably it does not apply in situations and applications where proximity to a particular user is itself 
sensitive information. 

2 Preliminaries 

We first describe how the users are represented in our model and provide a formal problem statement. Then 
we discuss the measure of utility as well as the privacy definition. We also state a classic result on preserving 
distances during dimensionality reduction which is crucial for our techniques to work. 
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2.1 User representation 

We represent each user belonging to a set U oi n users as a binary vector in d dimensions, where each 
dimension corresponds to the value of an attribute (e.g. gender, interest/disinterest in a particular topic, 
location information, etc.) We assume that the attribute meanings are not sensitive or, if they are, they 
can be published in a privacy-preserving manner (say, using the techniques in Goetz et al. or Korolova et 
al. [161I2Q]). 

Our goal can be formally stated as: Given a set of user profiles represented as vectors in d dimensions, 
publish sketches of the user profiles that simultaneously preserve user privacy and enable third parties to 
estimate pairwise distance between users. 

2.2 Utility measure 

We consider Euclidean (£2) distance between users as the distance measure we aim to preserve, as it is a 
natural choice for similarity search in high dimensions |15| . We discuss other distance measures in ij5j Our 
measure of utility is whether pairwise Euclidean distance between users can be recovered by a third party, 
who has access only to the transformed privacy- preserving user profiles. 

2.3 Privacy definition 

Any system that employs heuristic notions of privacy suffers from the fundamental problem that an adversary 
can come up with sophisticated attacks to breach the protection in ways that the system designer had not 
anticipated. Hence we contend that it is crucial to design a data release method with provable privacy 
guarantees. We adopt a rigorous approach to privacy introduced by Dwork et al. [T2] . which has gained 
widespread recognition in recent years (see a survey [10]). and has been used to demonstrate the feasibility 
of privacy-preserving data releases [31 [501 [131 121] • We adopt a slight variant of the definition introduced 
in Dwork et al. 

Definition 1 ((e, (5)-differential privacy). A (randomized) algorithm A satisfies (e,d)-differential privacy, if 
for all inputs X and X' differing in at most one user's one attribute value, and for all sets of possible outputs 
D C Range{A): 

Vy[A{X) e b]<e^ ■ ^v[A[X') eb]+5, 
where the probability is computed over the random coin tosses of the algorithm. 

Intuitively, the differential privacy guarantee states that an attacker who knows all attributes of all 
users except one attribute of one user cannot infer with confidence the value of that attribute, from the 
information published by the algorithm. The i5 parameter corresponds to the probability with which the 
preceding guarantee can fail, with 5 typically thought of as 0{l/n). The privacy guarantees also extend to 
small collections of (not necessarily related) attributes. 

The privacy guarantees may be achieved through introduction of noise to the output. In order to achieve 
the more stringent privacy guarantee of (e, 0)-differentially privacy, the noise added typically comes from the 
Laplace distribution jl2] . If one is willing to tolerate a more lenient guarantee of (e,(S), the noise can be 
added from the more tightly concentrated Normal distribution [TT] . 
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2.4 Johnson-Lindenstrauss transform 

A celebrated result in geometry, the Johnson-Lindenstrauss Lemma [H], states that for any set V of n 

points in M'', given Ajl > and k ~ D, (^^j^^, there exists a map that embeds the set into distorting 

all pairwise distances within at most 1 ± Ajl factor. The proof proceeds by showing that for any x,y & M.'^, 
a linear projection P e R''^'^ sampled from a carefully defined class satisfies 

(1 - Ajl)||x ^y\\l< \\xP ~ yP\\l < (1 + Ajl)||x - yg 

with certain probability 1 — Jjl over the choice of the projection matrix, where log y^- ~ 0{k\^), and then 
applying the union bound. 

This transform has become a fundamental tool in dimensionality reduction and similarity search in high 
dimensions, and computer science literature pill7j has proposed several constructions for P. 

3 Construction and Usage of Privacy-Preserving Projections 

We next describe the intuition as well as the technical components of our approach. Then we state our 
privacy and utility guarantees and provide their proofs. 

3.1 Algorithms for transforming user profiles and recovering distances 

Our mechanism for enabling data sharing with privacy consists of two components: 1) an algorithm that 
transforms the representation of each user into a privacy-preserving sketch and 2) an algorithm that recovers 
distances between users from the transformed user sketches. The intuition for the design of our mechanism 
is as follows: since we aim to preserve pairwise distances with the goal of performing user segmentation 
and nearest neighbor computations, an algorithm that performs a privacy-preserving transformation of user 
profiles while approximately maintaining pairwise distances would suffice. At the core of our method is a 
one-time privacy-preserving transformation of user profiles that can be published. All subsequent operations 
can be performed on this published data and therefore do not consume a "privacy budget" or pose additional 
privacy risk. 

Our algorithms are easy to state and implement, and do not require understanding of privacy. However 
the proofs of privacy and utility guarantees are non-trivial and require deeper analysis. 

3.1.1 Private projection algorithm 

The goal of Algorithm [T] (PrivateProjection) is to transform an n x d representation of user data into 
a representation that can be publicly shared without compromising the privacy of any individual involved 
and can simultaneously preserve distance characteristics of the original representation. First, the data is 
projected into a much lower dimension (fc <C d) to obtain a compact representation that preserves pairwise 
distances (steps 1-2), similar to many dimensionality reduction techniques. Then the resulting data is slightly 
perturbed (steps 3-4) to guarantee privacy of each user. The benefit of projecting onto a lower dimension 
and doing the perturbation is that we require less noise addition. 
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Algorithm 1 PrivateProjection 

Input: Boolean n x d matrix X whose rows correspond to people and columns correspond to attributes 
learned about the users by first parties; Privacy parameters e, S; Projected dimension k. 
Output: d X k projection matrix P; privacy-preserving n x k matrix Z, both of which can be pub- 
lished. 

1: Construct random d x k projection matrix P. 
2: Y := XP 

3: Construct random n x k noise matrix A, based on privacy parameters e, S and projection matrix P. 
4: Z:=Y + A 
5: Publish {P,Z). 



Intuitively, for a given level of desired privacy, there are two factors that affect utility and behave in 
opposite directions as we vary the projected dimension k. On the one hand, as k gets smaller, dimensionality 
reduction plays a greater role in the distortion of distances. On the other hand, as k gets larger, noise added 
plays a greater role in the distortion of distances. Finding the optimal value for the projected dimension k is 
challenging theoretically, as it depends on the underlying data distributions and the specific distance values 
we are trying to preserve. 

We next discuss the key components of Algorithm [1] namely, the choice of desired privacy guarantees 
(e, S) which determines the distribution of the noise, the choice of projection matrix P (and its sensitivity) 
and the corresponding choice of the parameters of the noise matrix's distribution. We remark that the 
projection matrix as well the noise matrix do not depend on A, but only require knowledge of the number 
of users n, the original dimension d and the desired privacy parameters. Following Kerckhoffs's Principle 
in Cryptography, we assume that the algorithm as well as the parameters n, d, k, e, J, P and the parameters 
used in the noise matrix arc publicly known. 

3.1.2 Choosing desired privacy guarantees 

The first decision in utilizing Algorithm [T] (PrivateProjection) is to determine the privacy guarantees 
desired by the algorithm's curator. The crucial observation is that one is able to guarantee (e, (5)-difFerential 
privacy by adding noise A from the Normal distribution, with the variance of the noise depending on the £2 
sensitivity of the chosen projection matrix P, which we define next. 

Definition 2 (£p-Scnsitivity of P). Define the Ip-sensitivity of a d x k projection matrix P ~ {Pij jdxfc, 

k — 

denoted by Wp{P), as the maximum £p-norm of any row in P, i.e., Wp{P) = maxi<i<c((^^_j^ l-Py l'') ■ 
Equivalently, Wp{P) can be defined as maxe^ ||eiP|jp, where {ei\f^i are standard basis unit vectors. 

3.1.3 Choosing projection matrix P 

There are many ways to choose a projection matrix for dimensionality reduction, depending on the properties 
of the data that need to be preserved. Our choice of P is guided by two considerations: (1) we would like 
to preserve pairwise £2 distances and thus user segmentation based on these distances (2) we would like to 
minimize the amount of noise to be added in order to maximize utility while guaranteeing privacy in the 
subsequent step. 

The natural candidate projection matrices for (1), preserving £2 distances between vectors, are the random 
projection matrices satisfying Johnson-Lindcnstrauss guarantees f ^2.4p . such as the ones below: 



5 



1. Each entry of the matrix drawn independently from a Normal distribution with mean and = 

\ik.m- 

2. Each entry of the matrix drawn independently and uniformly at random from { — --i=, +-^} [l|- 

3. Each entry of the matrix is chosen independently to be , 0, ^ \j\ with probability i, |, i, respec- 
tively d]. 

4. The extremely sparse projection matrix of Dasgupta et al. [9]. 

As we will see in the proof of privacy in Theorem [TJ when using noise A from Normal distribution, 
the amount of noise needed to preserve privacy depends on the choice of the projection matrix P; and 
more precisely, on the i!2-sensitivity of the chosen P. It is therefore desirable to use a projection matrix 
with low ^2 sensitivity, in order to ensure that we are adding the smallest possible amount of noise and 
therefore, maximizing utility while preserving privacy. The expected I2 sensitivity of all of the random 
projection matrices described above is tightly concentrated around 1 (using the alternative definition of 
'W2{P) = maxe; ||ejP||2, where {ei}f^i are standard basis unit vectors, and by applying the proofs of low 
distortion for these matrices) and therefore, all of them are suitable for privacy preserving transformations 
that aim to preserve the maximum utility. 

We emphasize that the specific measure of sensitivity of the matrix P, namely ^2-sensitivity, is driven by 
the type of noise added for privacy, which is Normal in our case, and not by the choice of norm one seeks to 
preserve under projection. 

3.1.4 The random noise matrix A 

The choice of the desired privacy guarantees and projection matrix P determines the noise matrix A. Each 
entry in A is drawn randomly and independently from Normal distribution with mean and variance 
a^, where the variance of the noise depends on ^2-sensitivity of the projection matrix P and the privacy 
parameters e and S. By choosing a satisfying the condition in Theorem [TJ the algorithm guarantees (e,(5)- 
differential privacy. 

3.1.5 Recover distance algorithm 

We next describe our algorithm for estimating the squared distance between two users, given their sketches 
released in a privacy-preserving manner using Algorithm[TJ Algorithmic] (RecoverDistancePP) computes 
the squared £2 distance between the transformed representations in the k dimensional space, and then 
discounts for the systemic positive distortion of the distance due to noise addition. The discount 2ka^ 
represents the expected distortion in the squared distance due to Gaussian noise addition. 

By repeated application of Algorithm [2l a third party can perform user segmentation and study the 
characteristics of the segments, as well as perform nearest-neighbor computations. 



Algorithm 2 RecoverDistancePP 

Input: n X k matrix Z published in a privacy-preserving manner; Noise parameter cr; Indices a, b of the 
desired users. 

Output: Estimated squared distance between users a and b in the original 

space. 

1: Let X and y be the ath and 5th rows in Z, respectively. 
2: Output distpp(userQ, userfe) = — y\\2 — 2ka^. 
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3.2 Formal privacy and utility guarantees 



We now prove formal privacy and utility guarantees for the blueprints of Algorithm [T] and Algorithm [51 for 
the case when noise A is drawn from Normal distribution and the utility goal is to preserve £2 distance 
between users. 

3.2.1 Privacy guarantees 

As Algorithm [2] uses already published data, it is sufficient to provide privacy guarantees for Algorithm [1] 
Theorem 1. Let W2{P) be the £2-sensitivity of the projection matrix P (see Definitionl^. Assuming S < ^, 

let the entries of the noise matrix be drawn from N{0,a^) with a > W2{P)-''—^^f^^^^. Then Algorithm]^ 
satisfies {e, 5) -differential privacy wrt a change in an individual person's attribute. 

A surprising feature of the algorithm and one that will turn out to be crucial for the utility of the algorithm 
is that the amount of noise one needs to add in order to satisfy privacy guarantees does not depend on the 
dimensions of the projection matrix P other than through a (possible) dependence of sensitivity W2{P) on 
P's dimension. The work of McSherry and Mironov |24j uses a similar observation relating multi-dimensional 
Gaussian noise and privacy guarantees without detailing the proof, so we provide the proof for completeness. 

We first prove a more general geometric statement, which we will then use to prove the privacy guarantees 
of our algorithm. The lemma extends the result of Dwork et al. to multiple dimensions. 

Lemma 1. Let Y and Y' be points m M' s.t. ||y — y' II2 < w. Then for any D C M', and any A drawn 
from N'-{0,a^), where a > ^^-EMmIBI arid 5 < ^, the following inequality holds: Pt[Y' + A e D] < 
e/ Pi[Y + Aeij]+S. 

Proof. The crucial insight is that due to spherical symmetry properties of Gaussian noise, we may choose 
the basis in such a way that Y and Y' differ in exactly one dimension. 

Partition D into two sets of points: Din = {D G D: {Y' -Y,D - Y') < vuR} and Dout = {D e 
D: {Y' — Y,D — Y') > wR}. The value of R will be determined later. We first prove that 

2ecr^ — w"^ 

Pr[r' + A e An] < e- Pr[r + A e A„], if i? > — , (1) 

2w 

and then prove that 

Pr[y' + A e Aut] < <5, if i? > a^2ln{^). (2) 

By choosing a so that R satisfies both constraints of ([T]) and ([2|), summing the resulting inequalities, and 
observing that Pr[y + A G An] < Pr[F + A e l)], we will obtain the desired bound. 

Proof of (U]) . By assumption R > ^^'^2^™ ■ definition of the Gaussian noise 
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The density function restricted to Din satisfies 



exp ( -^\\Y' zWl] = exp (-^ -z\\l-\\Y- Y'\\l 2{Y' -Y,z- Y')) 



exp ( -^\\Y - z\\l^ ■ exp {\\Y Y'\\l + 2{Y' -Y,z- Y')) 

<exp(-^||y-z||,j.exp(^ 
<exp( -^llr-zll^') .exp(e). 



It implies that 

Pr[r' + A e An] = , J- / exp (-^\\Y' -z\\l] dz 



V2^(jY Jd,,, \ 20-2 ' 

<-^^-^J^ exp(^-^||y-z||2^exp(e)dz<exp(e)Pr[r + Ae An]. 



Proof of ([2]). Recall that R > cry 2 In(^). We choose the coordinate system so that Y = {yi, . . . ,yk) and 
Y' = {y'l , . . . , j/^ ) differ only in the first coordinate and y'l < j/i . Then 

Aut = {DeD: {Y' -Y,D-Y')> wR} C {z e K'= : {y[ - y{){zi - y[) > wR}, 
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which imphcs the following bound on the probability of y + A falling inside -Dout : 
Pr[y' + A e -Dout] = 




z\\lj dzi... dzk 
- ZiS^ \ dzi. .. dzk 



(*) 



if i? > (7^2\i\{ jg). The bound (*) follows from 1 - erf(x) < exp(-x2) for x > [S]. 
Hence, for ([1]) and ^ to hold simultaneously, we need 



1 260-2 _ 2 
c7W21n( — ) <R< and i? > 0. 



By solving the resulting quadratic inequality we conclude that Lemma [T] holds if cr > w ''^ y^'"^ ^'^ ~ 
and S < ^. The claim follows by observing that y^2(ln(i5^) + e) > ^\n{^) + ^ln{^) + e. □ 

Proof of Theorem d The intuition behind the proof is to observe that a one-element difference in matrices 
X and X' will affect only one row of the projection. 

To prove that Algorithm [T] satisfies (e, 5)-difFerential privacy, we need to prove that for any two input 
matrices X and X', which differ in one element Xaj (corresponding to user a having 1 or value for attribute 
j), and for any _D, where Z) is a set of possible outputs of the algorithm, namely a set of n x A: matrices, the 
following inequality holds over the random choices of the algorithm: 

Vv[X'P + A e £»] < Vx[XP + A e £>] + (5, 

where A is a n x fc noise matrix, in which each element is drawn independently at random from A'^(0, ct^). 
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Fix X and X' , and recall the notation of Algorithni[TJ Wlog view Y and Y' (in a natural way) as flattened 
vectors of length nk rather than n x k matrices. Observe that if X and X' are binary and \\X' — X\\2 = 1, 

then - Yh = \\X'P - XP\\2 = \\{X' - X) ■ < maxi<,<<j y^E'=i = MP)- Applying the resuh 
of Lemma [1] to Y and Y' , we obtain the desired privacy guarantee. □ 

We remark that Theorem [T] applies even if the input matrix X consists of values in [0, 1] instead of 
Boolean values. 

In Figure [T] we depict the exact relationship between the privacy parameters e and S, and the variance 
of the noise needed, by plotting three curves of feasible (e, 5) pairs for three choices of a. The chart can be 
used either to determine legitimate values of e and 6 for a fixed a, or vice versa. Fixing the value cr to 1.0 
implies (e, (5)-privacy for all values of e, S in the middle curve. Alternatively, one can fix the values (e, S) to 
(1,0.1) and find a noise level cr « 1.0 that passes through the point. 



•^o = 0.5 •o = 1.0 **o = 2.0 




0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 
E 



Figure 1: Feasible values of e, 5 for a given choice of a. 



3.2.2 Utility guarantees for the Gaussian projection 

We next discuss the utility guarantees provided by our algorithms. We show that the squared Euclidean 
distance between two user vectors is preserved in expectation after the privacy transformations performed 
by our algorithms, and further provide guarantees on how far the distance after transformation can deviate 
from the original distance. From a third party's perspective, these guarantees imply that (a) the users who 
are close in the original space are likely to remain close in the transformed space and (b) similarly the users 
who are far apart are likely to remain so after the transformations. 

Concrete utility guarantees depend on the type of the projection matrix P. Among the possible choices 
for projection matrices described in Section I3.1.3[ we analyze the guarantees afforded by the use of the 
Gaussian projection matrix due to Indyk and Motwani |17] . proving that the resulting estimate of the 
squared Euclidean distance is unbiased, computing its variance and giving a tail probability bound. 

Although Algorithm [1] is stated as applied to n user vectors simultaneously, we will analyze its utility in 
preserving squared distances between a particular fixed pair of users. Consider two user vectors x,y G {0, 1}'' 
which are transformed by Algorithm [1] into x = xP + Ai and y = t/P+ A2, where Ai and A2 are independent 
fc-dimensional Gaussians N''{0,a'^). 
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Recall that according to Theorem [T] in order for Algorithm [T] to satisfy (e, 5) differential privacy, a is 
determined as a fmiction of e,S, and W2{P), which in turn depends on the projection matrix P. The 
following lemma bounds a for a given setting of e, 5 and k. 

Lemma 2. Let the projection matrix P be d x k matrix whose entries are i.i.d. N{0, l/k) random variables. 
Algorithm{^using the noise matrix whose entries are sampled from N{0, cr^) satisfies (e, S) -differential privacy 



k > 2(lnd + ln(2/(5)), 



e < ln(l/J). 

Proof. According to Theorem[T] (e, (5/2)-diffcrential privacy is satisfied if 



a > W2{P) 



V2(ln(l/5) + 6) 



(3) 



where W2{P) is P's ^2-scnsitivity. Since the entries of P are distributed as Gaussians, its sensitivity W2{P) 
has the following distribution: 



W2{P) 



\ 



max —Zi 

l<i<d k 



where Zi's are i.i.d. xt variables (i.e., distributed according to the chi-squared distribution with k degrees 
of freedom). Choosing x = Ind + ln(2/(5) and applying Lemma |4] (see Appendix), we find that 



Pr 



W2iP) > l + 



< 6/2. 



Under the assumption that k > 2{\nd+\n{2/d)), the probability that W2{P) is greater than 2 is less than 6/2. 
Combining this bound with (|3]), we find that Algorithm [T] satisfies (e, (5)-differential privacy for e < ln(l/(5) if 



cr> -VHl/6)> -V2(ln(l/<5) + e) and k > 2(lnd + ln(2/<5)). 



which completes the proof. 



□ 



The proof of Lemma [2] implies that the value of a can be chosen independent of P. This property is 
crucial for the following argument, which repeatedly uses independence of the matrix P and the noise A 
(scaled by cr). 

Theorem 2. Algorithms'^ and\^ where entries of P are sampled from N{0, l/k), and a is chosen indepen- 
dently of the realization of P, satisfy the following utility guarantees: 
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1. distpp is an unbiased estimator of \\x — y\\2-' 

E[dist|p(a;,y)] = \\x - y\\l 

2. Variance of distpp is given by the following expression: 

Var[dis4p(a;, y)] = 2\\x - y\\i/k + 8a^\\x - y\\l + %a^k. 

3. Deviations are bounded, i.e., with probability 1 — + 5^2 + 5]si), the following holds: 

|dist|p(x,2/) - ||a;-?/||2| < AjlUx - y\\l + 4cr^VkX-^2 + ^a^X^^ + 4ct(1 + AjL)AAr||x - y\\i, (4) 

when AjL < 1/2, (5jl > 2 cxp(-fcA^L/6); ^ 2cxp(-A22) and Jat > ^^j^^. 

Proof. First we note that A = Ai — A2 is distributed as a fc-dimensional Gaussian A^'°(0, 2(7^). Wc can 
express the random variable distpp (x, y) as a sum of three random variables Zi, Z2, Zx. 

distpp(a;,y) = - y\\l - 2ka^ = + Ai - yP - A2II2 - 2ka^ = 

= \\{x - y)P + A||^ - 2ka^ ^ \\{x - y)P\\l + 2{{x - y)P, A^ || Aj]^ - 2fca^ 

Z\ Z2 

For the given fixed choice of user vectors x and y, let z = 2; — y = (z\, . . . , z^) and r = ||a; — y\\2- Since the 
entries of P are i.i.d. according to A^(0, 1/fc), the projection {x — y)P is distributed according to A^'^(0, r'^/k). 
Indeed, the ith entry of {x — y)P has the following distribution: 

d d d 

Using the above expression and Lemma [6] we may write the variables Zi, Z2, as follows: 

Z, ^ \\n\Q,\\x - y\\l/k)f^^ r^ ■ Xl/k, 
Z2 - iV(0,8(T2Zi), 
Z3^2a2x^-2fca2, 

where is the chi-squarcd distribution with k degrees of freedom defined as the distribution of a sum of 
the squares of k independent A^(0, 1) random variables. 

Claim 1. To show that distpp(a;, y) is an unbiased estimator for = ||.t — y\W observe that the mean of 
the chi-squared distribution with k degrees of freedom is k. Therefore, 

E[Z2] = 0, 
E[Z3] = 0, 

and thus 

E[dist2p(x, y)] = E[Zi] + E[Z2] + £[^3] = r\ 



12 



Claim 2. To compute the variance of distpp(x, y), express it as 

Var(dist^p(a;, y)) = Var(Zi + Z2 + Z-j) = E[(Zi + Z2 + Zsf] - (E[Zi + Z2 + Z3])' = 

= E[Zf + Zl + Zl + 2Z1Z2 + 2Z1Z3 + 2Z2Z3] - (E[Zi] + E[Z2] + E[Z3])'. (5) 

Recall that by assumption of the theorem, a is chosen independently of P, therefore, (x — y)P and A are 
independent. The expectations of the pairwise products can be evaluated as follows: 

E[ZiZ2]=E[\\{x-y)P\\l-2{{x-y)P,A)] = E [2(|| (.t - y)P||2 • (.t - y)P, A)] =0, (by Lemma H]) 

E[ZiZ3] = E[Zi]E[Z3] = 0, (since Zi and Z3 are independent) 

E[Z2Z3] ^ E [2((a;-y)P, A) • (||A||^ - 2ka^)] = E [2{{x-y)P,A- (||A||^ - 2k<7^))] = 0. (by Lemmai]) 

Analyzing the other terms in equation ([5]), we have 

4 494 
E[Z2] - E[Zi]2 = Var(Zi) = Var(r2 • xl/k) - ^Var(xI) - ^2fc = 

E[Zf ] - E[Z3]2 = Var(Z3) = Var(2cr2xfe - '^ka^) = 4CT-^Var(xfc) = 8cr^/c, 

since Var(x^) — 2k. 

To finish the computation, we need to evaluate E[Z|] . Recall that Z2 = 2{{x — y)P, A) , where (a; — y)P ^ 
N^{Q,r'^/k) and A - iV''(0, 2cr^). Since E[Z2] = 0, the second moment of Z2 is Var(Z2), which can be 
computed as follows: 

Var(Z2) = Var {2 ^ iV(0, r'^/k) ■ N{0, 2a^)^ = fcVar (2iV(0, r^/k) ■ N{0, 2a'^)) = 8r'^a'^, 

(the last equation is because the mean of both Gaussians is zero, in which case the variance of the product 
of two independent variables is the product of their variances). 

Putting the above expressions together into equation ([S]) we obtain 

Var(dist^p(x, y)) ^ 2r'^/k + ScrV^ + 8cr''fc, 

as claimed. 

Claim 3. Towards proving deviation bounds, observe that 

|Zi — r^l < AjL^^ with probability at least 1 — (5jl (by [2 Lemma 4.1]) 

IZ2I < 'iaXN yfZi with probability at least 1 — 5n (Lemma [5] in Appendix) 

1^3 1 < 4(t^VA:A^2 + 4ct^A^2 with probability at least 1-5^2. (by HU Lemma 1]) 

Using the union bound and plugging in the bound on Zi into the second expression, we obtain the desired 
bound. 

By [U Lemma 4.1] |Zi — < Ajl?'^ with probability at least 

1 - 2cxp(-^(A2l/2- A^l/S)) > 1 -2eM-k\ldQ) > 1 - 
if AjL < 1/2 and (5jl > 2 exp(-fcA|^L/6)- □ 
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Optimal projection dimension: A natural question that our analysis leaves open is how to find an 
optimum number of dimensions k to which we should project. 

• To find the asymptotic of the optimal target dimension k for a fixed setting of the noise a and the failure 
probability Sji^+S^2+Sn ■, wc equate the failure probabilities fi = (Sjl = 5^2 = Sn for some fixed fi < 1/3. 
From the conditions on the A's in the statement of Theorem [5] it follows that A^2 = Q{^y\ogl/ n), 

Xn = 6(^log 1/^), and Ajl = Q{\J'^^^^). Optimizing the upper bound dU for k we obtain that 
fcoPT = e(v/b^-||a;-j/||2/a2). 

• Another approach for finding the optimal target dimension k for a fixed setting of the noise a would 
be to aim to minimize the variance of the squared distance estimate returned by the algorithm, which 

happens for /cqpt = ^-^-^pr^- 

Both of these analytic approaches imply that the optimal value for the target dimension of the privacy- 
preserving Johnson-Lindenstrauss transform depends on the expected distance between vectors measured 
using this mechanism, and it scales inversely proportionally to = 0(ln(l/(5)/e^) (Lemma [2]). For this 
choice of the parameters, the (additive) error in measuring \\x — is 0(tT -y/log l//i • ||a; — y\\2) and holds 
with probability 1 — /J. assuming that log l//x ^ fc. The variance of the estimator when k ~ \\x — y\\'^/{2a'^) 
is 16(7^ • 1 1 a; — yW^- An algorithm designer applying this algorithm in practice could consider using different 
projection matrices with varying fc's each optimized for a particular range of distances, and would need a 
logarithmic (in terms of possible distances) number of such projections. 

4 Alternative Approaches 

In this section we consider alternative approaches to release of pairwise distances of n vectors in . The 
first approach is based on output perturbation, where the noise is added directly to the final outcome of the 
mechanism, i.e., the n x n matrix of all pairwise distances. We argue that this method is inferior to privacy- 
preserving projections (previous section) in most settings. The other method is based on input perturbation, 
where the noise is added to the raw d-dimensional vectors. We compare the method to privacy-preserving 
projections, and discuss their ranges of applicability. 

4.1 Direct Noise Addition 

A classic result in differential privacy |llj shows that any function can be computed with (e, (5)-difFcrcntial 
privacy as long as the Gaussian noise calibrated according to the ^2-sensitivity of that function is added 
to the true function value prior to its announcement (Lemma [1]). Thus, a natural alternative approach to 
the Johnson-Lindenstrauss transform-based algorithm that we proposed is an algorithm publishing noisy 
versions of pairwise distances between points by adding properly calibrated noise to the true distances. We 
formalize this approach in Algorithms [3] and 21 
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Algorithm 3 NoiseAddition 

Input: Boolean n x d matrix X whose rows correspond to people and columns correspond to attributes 
learned about the users by first parties; Privacy parameters e, S. 

Output: Privacy-preserving strictly upper triangular n x n matrix Z , whose entries correspond to noisy 
pairwise distances squared, which can be published. 

1: Construct random n x n strictly upper triangular noise matrix A, based on privacy parameters e, S. 

2: Let y be a strictly upper triangular n x n matrix, such that for 1 < i < n,i < j < ii, yij — \\xi~ Xj\W. 

3; Z :=y + A 

4: Publish Z. 



Algorithm 4 RecoverDistanceNA 

Input: n X n matrix Z published in a privacy-preserving manner; Noise parameter cr; Indices a, & of the 
desired users (assume a < b wlog). 

Output: Estimated squared distance between users a and b. 
1: Output distNA(usera, userb) = Za^t- 



Similarly to the analysis in Section [3.2.11 Algorithm [3] preserves privacy if ct > y n ■ 2{lii{jg) + e) / e if 

S < 1/2, since a change in a single bit of X causes n changes in the matrix Y, each of magnitude one. 

Following the analysis of the previous section, consider the variance of the estimator dist^^. Since it is 
obtained by adding Gaussian noise drawn from A(0,cr^), it is exactly a^: 

Var(distNA) = = e(?iln(l/(5)/e2). 
Notice that the variance of the estimator is linear in the number of users n (i.e., rows of the matrix X). 

4.2 Comparison between PrivateProjection and NoiseAddition 

We use variance of the estimators to compare the accuracy of two methods for release of privacy-preserving 
pairwise distances. Recall that 

Var[distpp(a;,y)] = 2\\x - yWj/k + 8a1,p\\x - y\\l + Sappfc, 

where app = Q{^\n{l/d)/e) and k is the target dimension of the projection matrix. The distance \\x — y\\2 
for binary vectors x and y is equal to their Hamming distances, and does not exceed the sum of their weights. 
Let the maximal weight of a user vector be ii. For the target dimension k = 8(/i) (notice that it does not 
have to be optimal for the the given — y||2): 

Var[distpp(a;, y)] = 2\\x - y\\l/k + 8crpp||a:: - y\\l + 8c7ppk = 0(crpp • 

As argued in the previous section, 

Var(dist^A) = e(nln(l/(5)/e2), 

independent of x and y. We see that the the variance of the NoiseAddition method is larger than the 
variance of PrivateProjection if //o-pp = o(n). In other words, the PrivateProjection method is 
superior (in terms of the variance of the estimator) if the maximal weight of a user vector is much smaller 
than the total number of users. 
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4.3 Randomized response 



We next describe a technique known as randomized response studied by Warner in the 1960s [32]. Random- 
ized response is a natural, alternate solution to computing privacy-preserving user sketches. We compare 
this solution to our PrivateProjection (PP) approach. 

As before, we describe two separate algorithms: one technique for publishing data in a way that preserves 
privacy and another technique for estimating the squared ^2 distance between the original vectors given only 
the perturbed, private vectors. We show that randomized response's strength is in preserving large distances 
between users, whereas the strength of PP is in preserving small distances. Given the potential applications 
that we consider, of user segmentation and finding users near a given user, we conclude that PP is a more 
favorable solution. 

4.3.1 Privacy guarantees 

The algorithm suggested in the randomized response literature for preserving privacy is quite simple: Each 
bit of a user's vector is flipped with probability p (Algorithm [5]) . Observe that if p = i then the technique 
achieves perfect privacy, since any vector is equally likely to be published. However, publishing a random 
vector is worthless. On the other hand, if each bit is flipped with probability slightly less than i, as in 
Algorithm [SJ then one can show that some privacy is still preserved and yet the perturbed vectors can still 
be used to estimate the actual distance between vectors. 



Algorithm 5 RandomizedResponse 

Input: Boolean n x d matrix X whose rows correspond to people and columns correspond to attributes; 

Privacy parameter p < ^. 

Output: Privacy-preserving n x d matrix X. 



" I Xij with probability 1 — p 

1 : Xij '.— s 

I Xij with probability p 
2: Publish X. 



We discuss the relationship between the flipping probability p and differential privacy first. 
Lemma 3. Algorithm\^preserves (e, 0)- differential privacy when log < e, or equivalently when p > . 

The proof [261132] follows by considering two candidate vectors x and x' that differ in only one bit position, 
and showing that the ratio of the probability that x is published given x to the probability that x is published 
given x' is at most Setting this value to at most e'^ per the definition of differential privacy yields the 

lemma. 

4.3.2 Utility guarantees 

We now demonstrate that a third party equipped with the perturbed, private vectors published by Algo- 
rithm[5]can still approximate the squared £2 distance between pairs of users, via Algorithm[6l The algorithm 
first computes the squared £2 distance between the perturbed representations, and then accounts for the sys- 
temic distortion due to perturbation. 
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Algorithm 6 RecoverDistanceRR 



Input: n X d matrix X published in a privacy-preserving manner using Algorithm [SJ Privacy parameter p 
used; Indices a and b of the desired users. 

Output: Estimated squared distance between users a and b before perturba- 

tion. 

1: Let X and y represent vectors corresponding to users a and b before randomized response perturbation, 
and let x and y be the corresponding vectors after perturbation in the published matrix X. 



2: Output dist^j^(usera,userh) = ^(i-2pf^^^ 



Theorem 3. Algorithms\^ and \^ satisfy the following utility guarantees: 

1. dist^j^ is an unbiased estimator of \\x ~ yW^: 

E[dist^R(a;,y)] = \\x ^ y\\l. 

2. Deviations on squared distances are bounded as follows: 



|distL(.,,)HI-.ll^|<^^^^^^^^ 



V2{1 - 2pf 



with probability at least 1 — <^_r_r. 



Proof of Theorem O Let x and y be two vectors which after going through the randomized response process 
yield perturbed vectors x and y. Let w ^ \\x — y\W. We prove that dist^f;(y,x) is an unbiased estimate for 
w and is tightly concentrated around w. 



Claim 1. Assume wlog that x and y differ in the first w bits and agree on the remaining d ~ w bits. 
In the first w bits, E[||i — j/jH] is the expected number of positions where neither x nor y get flipped or 
both get flipped. In the remaining d — w positions, E[||x — yW^] is the expected number of positions where 
one gets flipped and not the other. Consequently, E[||i: — yjlj] = ((1 — p)'^ + p'^)w + 2p{l — p){d — w) = 
(1 - 2p)'^w + 2p{l - p)d. Thus 

E[distRR(a;,?/)J = _ ^^^^ = w = \\y ~ xW^- 

Claim 2. Observe that for any two bit values, the probability that the distance between them remains 
unchanged is q = p^ + {\ —pY, corresponding to both bits either being flipped or both remaining unchanged. 
Accordingly, the probability that the distance between any two bits changes is 1 — (7. 

Let li denote the indicator random variable corresponding to the distance between zth bit of x and y 
remaining unchanged despite perturbation. Then each /j can be viewed as an independent Bernoulli trial, 
with Pr[/j 1] = 

Let a = ^ii s-nd h = 'Yl'i=w+i ^i- other words, let a be the number of bit positions among the 

first w bits in which the distance between bits remains unchanged, i.e., remains 1, and let b be the number 
of bit positions among the remaining d — w bits, where the distance between bits changed (i.e., increases to 
1), due to perturbation introduced by Algorithm [5l Then — 2/II2 = a + b- 
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By HoefFding's inequality (applied to d independent random variables, the variance of each of which is 
bounded): 



Pr 



p-y|l^-E[|l^-y|l^]|>7 



Therefore, 



Pr 



|a + &-E[a + 6]| > 7 <2exp( — j-). 



Pr 



|distRR(y,x) - xllal > 7 



= Pr 



Pr 



= Pr 



\\y - x\\l - 2dpil - p) 2 



< 2exp 



(1 - 2py 



> 



\\y^xU + il-2prw~n\\y^S:\\l] 
(1 - 2p)2 

[\\\y-x\\l-E[\\y-x\\l]\>-f{l~2pf 
27^(1 - 2p)* 



Plugging in 7 



dlog(y^) 

2(i-2pfi ' ^® obtain the desired inequality. 



>7 



□ 



4.3.3 Comparison between PrivateProjection and RandomizedResponse 

In Theorems and [31 we showed that both PrivateProjection (PP) and RandomizedResponse (RR) 
algorithms preserve the expected squared distance between pairs of users, and computed the bounds on how 
likely it is that the actual values are concentrated around the expectation. 

Since both concentration bounds arc known to be tight in practice (see Venkatasubramanian and Wang j31| 
for an empirical study of the Johnson-Lindenstrauss transform), we follow the standard practice of compar- 
ing the concentration guarantees to determine which of the two privacy-preserving algorithms would better 
preserve utility. 

Consider the case when k is fixed. When the squared distance is 0{^/k), we show that both algorithms 
are inaccurate, when the squared distance is between Vk and \/dk/e^ our PrivateProjection algorithm is 
more accurate, and when the squared distance is larger than \/dk/e^, RandomizedResponse is preferable. 
To see why, consider that it follows from Lemma [3] that for Algorithm [S] to satisfy (e, 0)-differential privacy, 
the flip probability p has to be such that p > « 1/2 — e/4, which is accurate to within 10% for 

e < 1. For the purpose of comparison we choose a = 2-\/ln(ri)/e, resulting in (e, l/n)-difFerential privacy 
according to Theorem [TJ This is a conservative setting of the privacy parameters, roughly corresponding to 
a single violation of the (e, 0)-differential privacy guarantee over n users. Then, equating failure probabilities 
^J. = <5rr = (5jL ^ Sn ^ for some /i < 1, we have Ajl = e(v/log(l//i)/Vl), A^2 = e(^log(l//x)), 

l"^ and compare the error of the estimates distpp(a;, y) 
\^~y\\2 < Vdk/e^, which controls 



Xn = Q{y/\og{l/jL)) for some k. Fix two vectors a;, y € 
and dist^j^(a;, y) of the true squared ^2-distance -'"^ 



y\]2- As long as ^/k < 



the first term of the bound (jl]), and ^(Inn)^ <C d, which bounds the second term, the estimate distpp(x, y) 
is closer to the true distance, and hence PrivateProjection outperforms RandomizedResponse. The 
exact constants separating these regions depend on the privacy parameters and failure probabilities S's. 

Note however that the target dimension k is not fixed, but rather is selected by the curator, k can be 
selected with the goal of finding the sweet spot between preserving privacy and the utility of a given algorithm. 
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Alternatively, several sketches with different values of k can be released so as to preserve distances at multiple 
scales, each consuming its share of the privacy budget. 

5 Discussion 

In this section, we describe how user sketches released in a privacy-preserving way can be used by third 
parties, and conclude by discussing the limitations of sketches and our privacy guarantees. 

5.1 Applications 

We begin by re-iterating exactly what can be safely published: 

1. The d attribute meanings in the original vector space, assuming the meanings themselves are not sen- 
sitive, or the ones that arc published via a method similar to the one in [20] or |16) . 

2. The Johnson-Lindcnstrauss projection matrix P. 

3. For each user x, their userlD, together with their perturbed sketch xP + A. 

There are several actions that the third party can perform with this published information, depending on 
what kind of additional information the third party possesses about the users and the goal the third party 
is trying to achieve. 

Segmentation: User sketches can be segmented via some clustering algorithm and then information 
known to the third party about some members of the cluster can be generalized to the rest of the cluster. 
There is convincing evidence that segmentation of users into clusters is effective in some contexts Pll25ll27ll33j . 

Nearest Neighbors: Another application of perturbed sketches is finding nearest neighbors. For ex- 
ample, finding users most similar to an already known one can be useful in the context of online dating, and 
product and movie recommendations. 

5.2 Limitations 

Although our algorithm offers a method for privacy-preserving sharing of user data with third parties in a 
way that enables user-user distance computations, there are other tasks for which the user data shared using 
our method would not be useful to third parties. We also discuss the limitations of the privacy protections 
wc provide. 

5.2.1 Utility Limitations 

An important limitation of our work from the utility perspective is that the dimensions of the user sketches 
are impossible to interpret. As a consequence, the only way for a third party to select users satisfying a 
particular attribute is to project the vector corresponding to this attribute in the higher-dimensional space 
to the lower-dimensional space, and then perform the distance computation between user sketches and the 
obtained lower-dimensional attribute vector. However, as explained in ij4.3.3[ this computation would fall 
into the range of squared distance values for which both PrivateProjection and RandomizedResponse 
perform poorly. 

Furthermore, the proposed computation of user sketches weighs all attributes equally, which may not 
be desirable for third parties who want to prioritize similarity between users in some of the attributes over 
others. Computing multiple projections, each based on a different subset or weighing of the attributes would 
require use of additional privacy budget for each projection, as well as necessitate precluding the possibility 
of collusion among third parties. 
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Another limitation, which is a challenge for much of the privacy literature, is that our sketches provide 
a static snapshot of user data, and would require additional privacy budget in order to update them as the 
user information changes. The work of [13] offers directions for possibly overcoming this challenge. 

5.2.2 Privacy Limitations 

While our work takes an important step forward, privacy is more complex than ensuring that a third party 
cannot infer a particular attribute of a user. For example, if many of the attributes are correlated or 
representative of a higher-level user feature, then our techniques do not prevent a third party from inferring 
that. In other words, our guarantees apply to a constant number of attributes, but not to a persistent trend 
that exists in the data. Depending on the context, it may be more powerful to first categorize the attributes 
into a coarser granularity prior to producing perturbed sketches. 

Finally, as we explained in the Introduction, the goal of our work is to enable third parties to perform 
distance computations and clustering on users. Clearly, our work is not relevant for settings where such 
computations and privacy are fundamentally at odds, i.e., scenarios where the underlying data is so sensitive 
that even the ability to identify that two users are similar constitutes a privacy violation. 

6 Related Work 

Liu et al. [22] introduce and motivate the problem of releasing data to third parties with a goal that the 
original sensitive information cannot be inferred while preserving analytic properties of the data, such as 
inner product and Euclidean distance computations. Their approach is based on random projection to a 
lower-dimensional subspace using a projection matrix drawn from a distribution unknown to the adversary. 
The key distinction from our work is that they do not utilize an operational definition for what it means to 
protect the privacy of the data, and therefore, as they point out, there are scenarios in which an adversary can 
find approximations to original data (e.g., if the data is restricted to Boolean domain or adversary possesses 
certain background knowledge). Follow-up works tl4ii30j propose concrete attacks and demonstrate the 
vulnerabilities of the approach. Our use of differential privacy and addition of properly calibrated random 
noise after the projection enables us to provide a rigorous privacy guarantee, as well as gain insight into the 
change in utility depending on projected dimension used. Mukhcrjec ct al. I28| propose enabling distance- 
based mining algorithms over private data using Fourier-related transforms, but their approach has the same 
drawbacks as Liu et al. [22]. 

We discussed randomized response in and compared its performance with our method in i j4.3.3l In 
terms of privacy, randomized response offers a slightly better privacy guarantee. However, from a utility 
perspective, randomized response does not preserve "small" distances as effectively as the present work. 
Concretely, for users that are less than \/Wt distance apart, our method provides stronger guarantee than 
randomized response. Since third parties are likely interested in preserving small distances, we believe that 
our approach is more suitable for typical data mining applications. 

From a differential privacy perspective, alternative solutions could be used to attack our problem. For 
example, Blum et al. [5] give a method of running fc-Means on a private data set maintained by a trusted 
administrator. Their goal is to produce k cluster centers that are not too far from the k cluster centers 
that /c-Mcans would produce if the algorithm had access to the private data. Our goal differs in that we 
seek to publish data (or enable its utilization) along with the userlDs and enable identification of users who 
belong to the same cluster. Also, our noisy sketches can also be used for other distance-based computations 
such as nearest neighbors. Finally, another possible direction is not to publish any data and only allow 
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black-box queries to the first-party data provider, where the answers to such queries are perturbed [TUI23]. 
This approach may place considerable burden on the first-party data provider, and consumes privacy budget 
with each query posed. 



7 Conclusion and Future Work 

We proposed a viable solution to the challenge of publishing user data for enabling computation of distance 
between users, without revealing the values of user data attributes. The key insight behind our technique is 
that by projecting users to a lower-dimensional space, we can limit the amount of noise we add to each user's 
data, while also reaping the benefit of preserving distances. We also compared our proposed solution to other 
candidate solutions, such as directly adding noise to the pairwise distances or adding noise to each attribute 
of a user, and showed that our method is preferable for potential applications such as user segmentation and 
nearest neighbor search. 

There is ample opportunity to improve upon our results as the problem of privacy- preserving data sharing 
with third parties is naturally more complex than sharing in a way that enables distance computations. For 
example, third parties would benefit from data that enables computation of other data-mining primitives 
and from ability to operate on dynamically changing data jl3j . 



Lemma 4. Let Xi, . . . , Xn are i.i.d. variables drawn from Xk (chi-squared distribution with k degrees of 
freedom). For any a; > 



Proof. Wc use the bound due to Laurent and Massart [^H Lemma 1] on the tail probability of the chi-squared 
distribution: 



where X ^ x| . We establish the claim by taking the union bound over n independent variables and observing 



A Appendix 




Vt[X >k + 2Vkx + 2a;] < cxp(-a;) 



that 





for a; > 0. 



□ 



Lemma 5. Suppose X is drawn from N{0,a'^). Then 




The second bound is stronger than the first when x > 0.8a. 
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Proof. Expressing the tail probabilities in terms of the CDF of the standard Normal distribution denoted as 
<I> we have (the first bound): 

Pt[X < ~x] = $(-.T/cr), 
Pt[X > -x] = 1 - $(a;/cr), 

and thus 

Pr[|X| < X-] = 1 - ($(-x/cr) + 1 - ^•(x/cr)) = ^{x/a) - ^{-x/a) 



Alternatively (the second bound) 



Pr[|X| <x] = l-2 Pr[X > a;] > 1 ^ exp , 

The second bound is be stronger than the first as long as 

2ct^^ < 1, 

which holds when x > 0.8a. □ 
Lemma 6. Let X be an arbitrary distribution overM.^ and Y ^ N^lQ^a^), independent of X . Then 

{X,Y)^NiO,\\X\\la'). 

In particular, 

E[{X,Y)]^0. 

Proof Let X = {Xi, Xk) and Y ^ {Yi, ... ,Yk). Then 

(X, Y)^Y^ X,Y, ^ ^'^^(0' - E ^(0' ^''^') - ^(0' E - ^(0' ^'ll^ll'): 

i—l i—1 i—1 i—1 

by scaling and additive properties of the Gaussian distribution. □ 
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